By optimizing data gathering, HPC system metrics can be analyzed better
Executing the scientific mission of NOAA’s Geophysical Fluid Dynamics Laboratory (GFDL), which involves modeling, understanding, and then predicting the Earth's systems, is incredibly complex. Relying on high performance computing, is managed by huge scripts that define input data, experiment parameters, diagnostics, compute resources, data transfers, short- and long-term storage, and more. Today's "snap together" cluster computing solutions aim to meet complex mission needs like NOAA's. With complexity comes cost, and reducing computing throughput time by just 10 percent can result in massive benefits for its return on investment.
The complex hardware and systems software in these computing environments make understanding the interactions between the various concurrently running jobs virtually impossible. Job interactions and system conditions can lead to throughput variations, and it is extremely difficult and time-consuming to identify the root causes. Current system monitoring focuses on the various hardware elements with little or no connection to what's attempting to use the resources.
Ideally, system administrators, workflow engineers, and researchers could view what the job elements see about the resources they are getting, see this data across the entire system, and apply modern analytics to that data in order to improve the system.
NOAA, partnering with SAIC and Minimal Metrics, has introduced a prototype solution that addresses:
- Extensible and scalable data collection capabilities within the post-processing streams.
- Techniques to enhance data selection for criteria.
- A flexible data model that can identify “outlier” operations within the job set and then drive root-cause analysis of them.
- A capability for query execution against the data to visualize or otherwise represent throughput metrics and identify variability and root-cause analysis.
- Improvements to data-ingest and query processing time.
- An initial code base that accomplishes the above and sufficient documentation to maintain and extend the code.
- An initial web framework to support interactive exploration of the data in the database.
Improving productivity and reducing time to results
The solution can execute complex, ad-hoc queries against collected data. It can facilitate the association of a series of jobs. The obtained insights will provide decision support for future HPC system enhancements.
The project benefits those who are responsible for running, maintaining, optimizing, and debugging HPC systems, those responsible for enhancing HPC system resources, and those responsible for estimating resources needed to produce a particular piece of finished work (e.g., how much will it cost to a specific set of models end-to-end through the system and how and when will it be done?).